The wine quality dataset was created by Paulo Cortez (Univ. Minho), Antonio Cerdeira, Fernando Almeida, Telmo Matos and Jose Reis (CVRVV) @ 2009, using red wine samples. The inputs include objective tests (e.g. PH values) and the output is based on sensory data (median of at least 3 evaluations made by wine experts). Each expert graded the wine quality between 0 (very bad) and 10 (very excellent).
## [1] 1599 13
## [1] "X" "fixed.acidity" "volatile.acidity"
## [4] "citric.acid" "residual.sugar" "chlorides"
## [7] "free.sulfur.dioxide" "total.sulfur.dioxide" "density"
## [10] "pH" "sulphates" "alcohol"
## [13] "quality"
## 'data.frame': 1599 obs. of 13 variables:
## $ X : int 1 2 3 4 5 6 7 8 9 10 ...
## $ fixed.acidity : num 7.4 7.8 7.8 11.2 7.4 7.4 7.9 7.3 7.8 7.5 ...
## $ volatile.acidity : num 0.7 0.88 0.76 0.28 0.7 0.66 0.6 0.65 0.58 0.5 ...
## $ citric.acid : num 0 0 0.04 0.56 0 0 0.06 0 0.02 0.36 ...
## $ residual.sugar : num 1.9 2.6 2.3 1.9 1.9 1.8 1.6 1.2 2 6.1 ...
## $ chlorides : num 0.076 0.098 0.092 0.075 0.076 0.075 0.069 0.065 0.073 0.071 ...
## $ free.sulfur.dioxide : num 11 25 15 17 11 13 15 15 9 17 ...
## $ total.sulfur.dioxide: num 34 67 54 60 34 40 59 21 18 102 ...
## $ density : num 0.998 0.997 0.997 0.998 0.998 ...
## $ pH : num 3.51 3.2 3.26 3.16 3.51 3.51 3.3 3.39 3.36 3.35 ...
## $ sulphates : num 0.56 0.68 0.65 0.58 0.56 0.56 0.46 0.47 0.57 0.8 ...
## $ alcohol : num 9.4 9.8 9.8 9.8 9.4 9.4 9.4 10 9.5 10.5 ...
## $ quality : int 5 5 5 6 5 5 5 7 7 5 ...
## X fixed.acidity volatile.acidity citric.acid
## Min. : 1.0 Min. : 4.60 Min. :0.1200 Min. :0.000
## 1st Qu.: 400.5 1st Qu.: 7.10 1st Qu.:0.3900 1st Qu.:0.090
## Median : 800.0 Median : 7.90 Median :0.5200 Median :0.260
## Mean : 800.0 Mean : 8.32 Mean :0.5278 Mean :0.271
## 3rd Qu.:1199.5 3rd Qu.: 9.20 3rd Qu.:0.6400 3rd Qu.:0.420
## Max. :1599.0 Max. :15.90 Max. :1.5800 Max. :1.000
## residual.sugar chlorides free.sulfur.dioxide
## Min. : 0.900 Min. :0.01200 Min. : 1.00
## 1st Qu.: 1.900 1st Qu.:0.07000 1st Qu.: 7.00
## Median : 2.200 Median :0.07900 Median :14.00
## Mean : 2.539 Mean :0.08747 Mean :15.87
## 3rd Qu.: 2.600 3rd Qu.:0.09000 3rd Qu.:21.00
## Max. :15.500 Max. :0.61100 Max. :72.00
## total.sulfur.dioxide density pH sulphates
## Min. : 6.00 Min. :0.9901 Min. :2.740 Min. :0.3300
## 1st Qu.: 22.00 1st Qu.:0.9956 1st Qu.:3.210 1st Qu.:0.5500
## Median : 38.00 Median :0.9968 Median :3.310 Median :0.6200
## Mean : 46.47 Mean :0.9967 Mean :3.311 Mean :0.6581
## 3rd Qu.: 62.00 3rd Qu.:0.9978 3rd Qu.:3.400 3rd Qu.:0.7300
## Max. :289.00 Max. :1.0037 Max. :4.010 Max. :2.0000
## alcohol quality
## Min. : 8.40 Min. :3.000
## 1st Qu.: 9.50 1st Qu.:5.000
## Median :10.20 Median :6.000
## Mean :10.42 Mean :5.636
## 3rd Qu.:11.10 3rd Qu.:6.000
## Max. :14.90 Max. :8.000
##
## 3 4 5 6 7 8
## 10 53 681 638 199 18
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 3.000 5.000 6.000 5.636 6.000 8.000
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 4.60 7.10 7.90 8.32 9.20 15.90
Many of the variables look normally distributed. Chlorides, sulphates, alcohol, free sulfur dioxide and total sulfur dioxide look like they have lognormal distributions. Let’s exclude the 95th percentile for all these five features and re-plot their histograms:
Number of red wine instances: 1599 Number of Attributes: 1 Serial Number + 11 Attributes + 1 Output Attribute
11 Attributes:
1 - fixed acidity: most acids involved with wine or fixed or nonvolatile (do not evaporate readily)
2 - volatile acidity: the amount of acetic acid in wine, which at too high of levels can lead to an unpleasant, vinegar taste
3 - citric acid: found in small quantities, citric acid can add ‘freshness’ and flavor to wines
4 - residual sugar: the amount of sugar remaining after fermentation stops, it’s rare to find wines with less than 1 gram/liter and wines with greater than 45 grams/liter are considered sweet
5 - chlorides: the amount of salt in the wine
6 - free sulfur dioxide: the free form of SO2 exists in equilibrium between molecular SO2 (as a dissolved gas) and bisulfite ion; it prevents microbial growth and the oxidation of wine
7 - total sulfur dioxide: amount of free and bound forms of S02; in low concentrations, SO2 is mostly undetectable in wine, but at free SO2 concentrations over 50 ppm, SO2 becomes evident in the nose and taste of wine
8 - density: the density of water is close to that of water depending on the percent alcohol and sugar content
9 - pH: describes how acidic or basic a wine is on a scale from 0 (very acidic) to 14 (very basic); most wines are between 3-4 on the pH scale
10 - sulphates: a wine additive which can contribute to sulfur dioxide gas (S02) levels, wich acts as an antimicrobial and antioxidant
11 - alcohol: the percent alcohol content of the wine
Output variable (based on sensory data):
12 - quality (score between 0 and 10)
Quality is the main feature.
Residual sugar, fixed acidity, pH, density and alcohol content may help support the investigation into the quality.
No, I didn’t.
Attributes of chlorides, total sulfur dioxide, and free sulfur dioxide, sulphates, alcohol were all appeared to be long tailed and were log-transformed which revealed a normal distribution for each.
## [1] "fixed.acidity" "volatile.acidity" "citric.acid"
## [4] "residual.sugar" "chlorides" "free.sulfur.dioxide"
## [7] "total.sulfur.dioxide" "density" "pH"
## [10] "sulphates" "alcohol" "quality"
With our main feature of the dataset, the positive correlation coefficients which are more then 0.1 are:
alchol:quality = 0.48
sulphates:quality = 0.25
citric.acid:quality = 0.23
fixed.acidity:quality = 0.12
So alcohol content has a high correlation with red wine quality. Other important attributes correlated with red wine quality include sulphates, citric acid and fixed acidity.
With our main feature of the dataset, the negative correlation coefficients which are less then -0.1 are:
volatile.acidity:quality = -0.39
total.sulfur.dioxide:quality = -0.19
density:quality = -0.17
chlorides:quality = -0.13
So we see that volatile acids are negatively correlated with red wine quality, as described from the document that is at too high of levels can lead to an unpleasant, vinegar taste. Total sulfur dioxide, density and chlorides are also negatively correlated with quality.
Besides, other attributes wiht the highest (positive or negative) correlation are:
fixed.acidity:pH = -0.68
fixed.acidity:citric.acid = 0.67
fixed.acidity:density = 0.67
free.sulfur.dioxide:total.sulfur.dioxide = 0.67
volatile.acidity:citirc.acid = -0.55
citric.acid:pH = -0.54
density:alcohol = -0.50
As we all know, the stronger the acid is, the lower pH will be. So it is make sence that either fixed acidity or citric acid has a high negative correlation with pH. I will focus on several other highest correlation relationships in a bit more detail.
As of the quality, it appears that when alchol or sulphates is in higher amounts, the quality will be better also. However, the amount of volatile acidity is negatively correlated with the quality. It is likely that fresher wines avoid the bitter taste of acetic acid.
As of citric acid, fixed acidity is positively correlated with the citric acid, but the amount of volatile acidity is opposite. As of density, fixed acidity is also positively correlated with the citric acid, but the amount of alcohol is opposite.
From the variables analyzed, the strongest relationship was between fixed.acidity and pH, which had a correlation coefficient of 0.68.
Now let’s visualize the relationship between sulphates, volatile.acidity, alcohol and quality: Let’s try to summarize quality using a contour plot of volatile acidity and sulphate content:
Let’s try to summarize quality using a contour plot of citric acid and alcohol content:
Based on the multivariate analysis, five features stood out to me: alcohol, sulphates, citric acid, volatile acidity, and quality. Volatile acidity with amount between 0.3 and 0.5 and sulphates with amount between 0.6 and 0.9 were a strong indicator of the presence of good wine. Also, high alcohol content and higher citric acid have more chance to make for a good wine.
As analyzing relationship between quality and other 11 attributes, the strongest correlation coefficient was found between alcohol and quality.
## # A tibble: 6 x 2
## quality n
## <int> <int>
## 1 3 10
## 2 4 53
## 3 5 681
## 4 6 638
## 5 7 199
## 6 8 18
## wqr$quality: 3
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 8.400 9.725 9.925 9.955 10.575 11.000
## --------------------------------------------------------
## wqr$quality: 4
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 9.00 9.60 10.00 10.27 11.00 13.10
## --------------------------------------------------------
## wqr$quality: 5
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 8.5 9.4 9.7 9.9 10.2 14.9
## --------------------------------------------------------
## wqr$quality: 6
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 8.40 9.80 10.50 10.63 11.30 14.00
## --------------------------------------------------------
## wqr$quality: 7
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 9.20 10.80 11.50 11.47 12.10 14.00
## --------------------------------------------------------
## wqr$quality: 8
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 9.80 11.32 12.15 12.09 12.88 14.00
Clearly we see that the box plots for higher quality red wines are up shifted, meaning they have a comparatively higher alcohol content, compared to the lower quality red wines.
Observe that lower sulphates content typically leads to a bad wine with alcohol varying between 9% and 12%. Average wines have higher concentrations of sulphates, however wines that are rated 6 tend to have higher alcohol content and larger sulphates content. Excellent wines are mostly clustered around higher alcohol contents and higher sulphate contents.
This shows that higher quality red wines are generally located near the range from 0.25 to 0.65 of citric acid and slso near the higher alcohol which is more than 10.5%. Whereas lower quality red wines are generally with lower either alcohol or citric acid.
The red wine dataset contains information on 1,599 red wine instances, 11 attributes and one output attribute. Initially, I tried to get a sense of how is each attribute changing on their own. All univariate plots have been arranged together. Many of the variables look normally distributed. However, chlorides, sulphates, alcohol, free sulfur dioxide and total sulfur dioxide look like they have lognormal distributions. So I exclude the 95th percentile for all above five features and re-plot their histograms.
Then, I tried to find what factors might affect the quality of the wine. At this moment, pearson correlation coefficient can help us to visualize the relationship between each pair of variables. Using the insights from correlation coefficients provided by the paired plots, it was interesting exploring quality using box plots with a different color for each quality. Besides, melting the dataframe and using facet grids was really helpful for visualizing the distribution of the parameters with the use of scatter plots. Finally, using a contour plot of wine quality with a point plot of volatile acidity and alcohol would be a good choice to show that either the lower volatile acidity or higher alcohol have more possible to make a better wine. The result makes sense. Volatile acidity is mostly caused by bacteria in the wine which is the amount of acetic acid in wine. It can lead to an unpleasant, vinegar taste if at too high of levels.
P. Cortez, A. Cerdeira, F. Almeida, T. Matos and J. Reis. Modeling wine preferences by data mining from physicochemical properties. In Decision Support Systems, Elsevier, 47(4):547-553. ISSN: 0167-9236.